Is there a competitive advantage in how businesses manage their
key assets?
Predicting failures impacts the bottom line.
Consider the following:
On the other hand, predictions are not perfect.
In this analysis, I assign costs to these prediction scenarios in order to find an “optimal” solution that works for the business.
This data set is the Kaggle version of the very well known public data set for asset degradation modeling from NASA. We are given 100 simulated jet engine data, with over 20 sensor readings contaminated with considerable noise.
The goal is to predict the “Remaining Useful Life” given sensor readings for a particular engine cycle (possibly equivalent to flights or days).
The dataset granularity is engine (unit_number) and cycle. Thus, one engine could have many records, each record pertaining to a specific cycle.
Each engine in the training dataset is run-to-failure. Thus we can precisely calculate the RUL for every cycle. However, the validation dataset has engines that are observed at a specific point in time; some are still running, and some have failed.
Step 1: Simplify the problem for a quick win. Instead of predicting RUL, can we predict engines that have less than 30 cycles remaining before failure? This moves from a regression problem to a classification problem. I came to this conclusion after exploring the data and seeing clear demarcation between sensor values on both sides of this boundary. Presumably, 30 cycles would be enough time to secure parts needed for maintenance; this is where the business would need to weigh in. Certainly, the more lead time the better, but that also comes with less confidence in the prediction - it’s a balance.
I used two different classification algorithms: Logistic Regression and eXtreme Gradient Boosting Each has a different optimization method and may lead to different results depending on the data.
Step 2: Once we get learnings from this, then, predict the RUL which will consider the sequential component of our data. This will require more advanced neural network algorithms like LSTM, (Long Short-Term Memory). I will tackle this in another script.
The dataset can be found here:
https://www.kaggle.com/datasets/behrad3d/nasa-cmaps https://www.nasa.gov/content/prognostics-center-of-excellence-data-set-repository
First, the majority of the training data represented “no failure” cycles, as expected. Most machine learning algorithms work best when the number in each class (failures vs no failures ) is about the same. Otherwise, we could get very high accuracy just by focusing on the majority class. Thus, I downsampled the majority class to be approximately the same ratio as the failures.
Second, the algorithms I used provided a probability of failure. This is where I introduced the business cost of an unplanned failure (false negatives) vs prematurely predicting failure (false positives). These costs can radically change the outcome so these need to be considered carefully.
Third, I incorporated survival curves from the statistical field of “survival analysis”. I did this because these curves are a powerful way to share the big picture with business customers. They actually are a model as well, but, I’m using them in a descriptive vs predictive way for visualizing engine deterioration over time.
Results
For the machine learning algorithms, the metrics we care about are Sensitivity and Specificity.
Sensitivity: When the engines actually failed within 30 cycles, how often was the prediction “yes”?
Specificity: When the engines did not fail within 30 cycles, how often was the prediction “no”?
Compare Cost of doing nothing (run to failure) and predicting/preventing failures using Machine learning A disproportionate cost (penalty) is applied to each false negative. Here, the units are not important, we are looking at the relative cost between these different approaches. Machine learning approaches offer significant savings.
# Package names
packages <- c("tidyverse", # wrangling
"corrplot", # correlation plot
"ggplot2" , # plots
"parsnip", # wrapper package for modeling
"recipes", # Preprocessing & Sampling
"rsample", # Preprocessing & Sampling
"rpart.plot", # plot simple Decision Tree
"tidyquant", # theme for visuals
"ROCR" , # needed for ROC curve
"themis" , # downsample
"survival", # survival analysis
"survminer", # survival plot
"xgboost", # algorithm
"dplyr", # wrangling
"correlationfunnel", # correlations, automatic binning
"plotly", # visualization
"caret" , # confusion matrix
"kableExtra" # Data Formatting
)
# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# Packages loading
invisible(lapply(packages, library, character.only = TRUE))
identifiers <- c('unit_number', 'time_cycles')
setting_names <- c('setting_1', 'setting_2', 'setting_3')
sensor_names <- paste0('sensor_', 1:21)
col_names <- c(identifiers, setting_names, sensor_names)
file_path <- "data/prevMtce/RUL_NASA_Engines"
file_train <- "train_FD001.txt"
file_test <- "test_FD001.txt"
file_RUL <- "RUL_FD001.txt"
train_raw <- read_delim(paste0(file_path,'/',file_train), delim = " ", col_names = col_names, show_col_types = FALSE)
valid_raw <- read_delim(paste0(file_path,'/',file_test), delim = " ", col_names = col_names, show_col_types = FALSE)
valid_answer_raw <- read_delim(paste0(file_path,'/',file_RUL), delim = " ", col_names = "rul_last", show_col_types = FALSE)
train_raw_tbl <- tibble(train_raw)
valid_raw_tbl <- tibble(valid_raw)
valid_answer_raw_tbl <- tibble(valid_answer_raw)
# remove extraneous cols, add new helper cols
train_tbl <- train_raw_tbl %>%
select(everything(), -starts_with("X") ) %>%
mutate(count_me = 1
,time_cycles_norm = time_cycles
)
valid_tbl <- valid_raw_tbl %>%
select(everything(), -starts_with("X") ) %>%
mutate(count_me = 1
,time_cycles_norm = time_cycles)
# add row number as "unit number" as the original data was just the RUL with no identifier
valid_answer_tbl <- valid_answer_raw_tbl %>%
select(everything(), -starts_with("X") ) %>%
mutate(unit_number = row_number())
train_rul_tbl <- train_tbl %>%
group_by(unit_number) %>%
mutate(max_cycles = max(time_cycles),
rul = max_cycles - time_cycles
) %>%
ungroup()
valid_max_tbl <- valid_tbl %>%
group_by(unit_number) %>%
mutate(max_cycles = max(time_cycles)
) %>%
ungroup() %>%
# join to valid_answer_tbl
left_join(valid_answer_tbl) %>%
# calculate rul
mutate(
rul = rul_last + (max_cycles - time_cycles)
) %>%
select(-rul_last)
train_max_cycles <- train_rul_tbl %>%
select( unit_number
,max_cycles
) %>%
distinct()
valid_max_cycles <- valid_max_tbl %>%
select( unit_number
,max_cycles
) %>%
distinct()
paste("The number of nulls in train:",sum(is.na(train_rul_tbl)), ", the number of nulls in validation:", sum(is.na(valid_max_tbl)) )
## [1] "The number of nulls in train: 0 , the number of nulls in validation: 0"
## [1] "There are 20631 rows, and 30 columns in train_rul_tbl"
## [1] "There are 13096 rows, and 30 columns in valid_max_tbl"
glimpse(train_rul_tbl)
## Rows: 20,631
## Columns: 30
## $ unit_number <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ time_cycles <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
## $ setting_1 <dbl> -0.0007, 0.0019, -0.0043, 0.0007, -0.0019, -0.0043, 0~
## $ setting_2 <dbl> -4e-04, -3e-04, 3e-04, 0e+00, -2e-04, -1e-04, 1e-04, ~
## $ setting_3 <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100~
## $ sensor_1 <dbl> 518.67, 518.67, 518.67, 518.67, 518.67, 518.67, 518.6~
## $ sensor_2 <dbl> 641.82, 642.15, 642.35, 642.35, 642.37, 642.10, 642.4~
## $ sensor_3 <dbl> 1589.70, 1591.82, 1587.99, 1582.79, 1582.85, 1584.47,~
## $ sensor_4 <dbl> 1400.60, 1403.14, 1404.20, 1401.87, 1406.22, 1398.37,~
## $ sensor_5 <dbl> 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.62, 14.6~
## $ sensor_6 <dbl> 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.61, 21.6~
## $ sensor_7 <dbl> 554.36, 553.75, 554.26, 554.45, 554.00, 554.67, 554.3~
## $ sensor_8 <dbl> 2388.06, 2388.04, 2388.08, 2388.11, 2388.06, 2388.02,~
## $ sensor_9 <dbl> 9046.19, 9044.07, 9052.94, 9049.48, 9055.15, 9049.68,~
## $ sensor_10 <dbl> 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3~
## $ sensor_11 <dbl> 47.47, 47.49, 47.27, 47.13, 47.28, 47.16, 47.36, 47.2~
## $ sensor_12 <dbl> 521.66, 522.28, 522.42, 522.86, 522.19, 521.68, 522.3~
## $ sensor_13 <dbl> 2388.02, 2388.07, 2388.03, 2388.08, 2388.04, 2388.03,~
## $ sensor_14 <dbl> 8138.62, 8131.49, 8133.23, 8133.83, 8133.80, 8132.85,~
## $ sensor_15 <dbl> 8.4195, 8.4318, 8.4178, 8.3682, 8.4294, 8.4108, 8.397~
## $ sensor_16 <dbl> 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03,~
## $ sensor_17 <dbl> 392, 392, 390, 392, 393, 391, 392, 391, 392, 393, 392~
## $ sensor_18 <dbl> 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388, 2388,~
## $ sensor_19 <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100~
## $ sensor_20 <dbl> 39.06, 39.00, 38.95, 38.88, 38.90, 38.98, 39.10, 38.9~
## $ sensor_21 <dbl> 23.4190, 23.4236, 23.3442, 23.3739, 23.4044, 23.3669,~
## $ count_me <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ time_cycles_norm <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
## $ max_cycles <dbl> 192, 192, 192, 192, 192, 192, 192, 192, 192, 192, 192~
## $ rul <dbl> 191, 190, 189, 188, 187, 186, 185, 184, 183, 182, 181~
create_histogram <- function(data, main_title="title", xlab="xlab", breaks=20) {
hist(data
, main = main_title
, xlab = xlab
, breaks = breaks)
# Calculate the average and median
average <- mean(data)
average_str <- as.character(average)
median <- median(data)
median_str <- as.character(median)
# Add a vertical line for the median
abline(v = average, col = "orange", lwd = 2)
abline(v = median, col = "blue", lwd = 2)
legend("topright", legend = c(paste("Average",average_str), paste("Median",median_str)),
col = c("orange", "blue"), lwd = 2)
}
create_histogram(train_max_cycles$max_cycles,main_title="Training Data: Max Cycle at Failure", xlab="Cycles",breaks = seq(min(train_max_cycles$max_cycles) - 30, max(train_max_cycles$max_cycles) + 30, by = 10) )
create_histogram(valid_max_cycles$max_cycles,main_title="Validation Data: Max Cycle Given (Not Necessarily a Failure)", xlab="Cycles", breaks = seq(min(valid_max_cycles$max_cycles) - 30, max(valid_max_cycles$max_cycles) + 30, by = 10))
# Create a new plot with 1 row and 3 columns
par(mfrow = c(1, 3))
setting1_hist <- hist(train_rul_tbl$setting_1, main = "Setting 1", xlab = "")
setting2_hist <- hist(train_rul_tbl$setting_2, main = "Setting 2", xlab = "")
setting3_hist <- hist(train_rul_tbl$setting_3, main = "Setting 3", xlab = "")
train_rul_tbl %>%
select(unit_number, setting_1, setting_2, setting_3, count_me) %>%
distinct() %>%
group_by(unit_number) %>%
summarise(sum_distinct_settings = sum(count_me))
## # A tibble: 100 x 2
## unit_number sum_distinct_settings
## <dbl> <dbl>
## 1 1 170
## 2 2 242
## 3 3 165
## 4 4 166
## 5 5 219
## 6 6 174
## 7 7 217
## 8 8 133
## 9 9 176
## 10 10 193
## # i 90 more rows
plot_hist_facet <- function(data, bins = 10, ncol = 5,
fct_reorder = FALSE, fct_rev = FALSE,
fill = "gray", #palette_light()[[3]],
color = "white", scale = "free") {
data_factored <- data %>%
mutate_if(is.character, as.factor) %>%
mutate_if(is.factor, as.numeric) %>%
gather(key = key, value = value, factor_key = TRUE)
if (fct_reorder) {
data_factored <- data_factored %>%
mutate(key = as.character(key) %>% as.factor())
}
if (fct_rev) {
data_factored <- data_factored %>%
mutate(key = fct_rev(key))
}
g <- data_factored %>%
ggplot(aes(x = value, group = key)) +
geom_histogram(bins = bins, fill = fill, color = color) +
facet_wrap(~ key, ncol = ncol, scale = scale) +
theme_tq()
return(g)
}
skewed_feature_names <- train_rul_tbl %>%
select_if(is.numeric) %>%
map_df(skewness) %>%
gather(factor_key = T) %>%
arrange(desc(abs(value))) %>%
kbl(caption = "Skew") %>%
kable_classic(full_width = F, html_font = "Cambria")
skewed_feature_names
| key | value |
|---|---|
| sensor_6 | -6.9163101 |
| sensor_9 | 2.5551791 |
| sensor_14 | 2.3723811 |
| max_cycles | 0.8887062 |
| time_cycles | 0.4998676 |
| time_cycles_norm | 0.4998676 |
| rul | 0.4998676 |
| sensor_8 | 0.4793760 |
| sensor_13 | 0.4697583 |
| sensor_11 | 0.4692950 |
| sensor_4 | 0.4431621 |
| sensor_12 | -0.4423751 |
| sensor_7 | -0.3943003 |
| sensor_15 | 0.3882304 |
| sensor_20 | -0.3584191 |
| sensor_17 | 0.3531000 |
| sensor_21 | -0.3503495 |
| sensor_2 | 0.3165029 |
| sensor_3 | 0.3089233 |
| unit_number | -0.0678103 |
| setting_1 | -0.0247645 |
| setting_2 | 0.0090845 |
| setting_3 | NaN |
| sensor_1 | NaN |
| sensor_5 | NaN |
| sensor_10 | NaN |
| sensor_16 | NaN |
| sensor_18 | NaN |
| sensor_19 | NaN |
| count_me | NaN |
# Function to create line charts with moving average
create_line_chart <- function(data, column_names, descriptive_names, unit_numbers, window_size = 10) {
# Filter the data based on unit_numbers
filtered_data <- data %>%
filter(unit_number %in% unit_numbers)
# Loop through the columns
for (i in 1:length(column_names)) {
# Get the column name, descriptive name, and color
column <- column_names[i]
descriptive_name <- descriptive_names[i]
# Create a new data frame for storing the moving averages
moving_averages <- data.frame()
# Loop through the unit_numbers
for (unit in unit_numbers) {
# Filter the data for the current unit_number
unit_data <- filtered_data %>%
filter(unit_number == unit)
# Calculate moving average for the current unit
moving_average <- rollmean(unit_data[[column]], k = window_size, fill = NA, align = "right")
# Create a new data frame for the current unit's moving average
unit_moving_average <- data.frame(rul = unit_data$rul, moving_average, unit_number = unit)
# Add the unit_moving_average to the overall moving_averages data frame
moving_averages <- bind_rows(moving_averages, unit_moving_average)
}
# Create the line chart with moving averages
p <- ggplot(moving_averages, aes(x = rul, y = moving_average, color = factor(unit_number))) +
geom_line() +
geom_vline(xintercept = 30, col = "gray", lwd = 1) +
labs(title = c(paste(descriptive_name,"Window Size for Moving Avg:", as.character(window_size)) ), x = "RUL", y = descriptive_name) +
theme_minimal() +
scale_x_continuous(trans = "reverse") # Reverse the x-axis ticks
# Display the chart
print(p)
}
}
# Select random 10 units to show in line charts
set.seed(42) # Set seed for reproducibility
sample_nums <- sample(1:100, 10)
unit_number_list <- as.character(sample_nums)
# create vector of column names that have no variance
no_var_colnames <- c("unit_number", "sensor_1", "sensor_5", "sensor_6", "sensor_10", "sensor_16", "sensor_18", "sensor_19", "setting_3", "rul", "count_me", "time_cycles", "max_cycles", "fail_in_30")
column_names <- colnames(train_rul_tbl)
cols_for_chart <- column_names[!(column_names %in% no_var_colnames)]
# Call the function
create_line_chart(data = train_rul_tbl,
column_names = cols_for_chart,
descriptive_names = cols_for_chart,
unit_numbers = unit_number_list,
window_size = 10
)
sample_nums <- seq(1:100)
unit_number_list <- as.character(sample_nums)
cols_for_chart <- "sensor_9"
create_line_chart(data = train_rul_tbl,
column_names = cols_for_chart,
descriptive_names = cols_for_chart,
unit_numbers = unit_number_list,
window_size = 10
)
train_rul_tbl <- train_rul_tbl %>%
mutate(fail_in_30 = as.factor(case_when(rul <= 30 ~1,
TRUE ~ 0)))
#train
pct<-round(table(train_rul_tbl$fail_in_30)/length(train_rul_tbl$fail_in_30)*100,1)
labs<-c("No-Fail-30", "Fail-30")
labs<-paste(labs,pct)
labs<-paste(labs, "%", sep="")
pie(table(train_rul_tbl$fail_in_30),labels=labs,col=c("gray", "navy"),main="Train: %Fail")
valid_max_tbl <- valid_max_tbl %>%
mutate(fail_in_30 = as.factor(case_when(rul <= 30 ~1,
TRUE ~ 0)))
pct<-round(table(valid_max_tbl$fail_in_30)/length(valid_max_tbl$fail_in_30)*100,1)
labs<-c("No-Fail-30", "Fail-30")
labs<-paste(labs,pct)
labs<-paste(labs, "%", sep="")
pie(table(valid_max_tbl$fail_in_30),labels=labs,col=c("gray", "navy"),main="Validation:%Fail")
set.seed(42)
desired_minority_count <- round(nrow(train_rul_tbl) * 0.15)
# Sample the majority group
sampled_majority <- train_rul_tbl %>%
filter(fail_in_30 == 0) %>%
slice_sample(n = desired_minority_count, replace = FALSE) # Adjust the multiplication factor based on majority to minority ratio (17531/3100)
# Sample the minority group
sampled_minority <- train_rul_tbl %>%
filter(fail_in_30 == 1) %>%
slice_sample(n = desired_minority_count, replace = FALSE)
# Combine the sampled majority and minority groups
train_sampled_tbl <- bind_rows(sampled_majority, sampled_minority)
#train
pct<-round(table(train_sampled_tbl$fail_in_30)/length(train_sampled_tbl$fail_in_30)*100,1)
labs<-c("No-Fail-30", "Fail-30")
labs<-paste(labs,pct)
labs<-paste(labs, "%", sep="")
pie(table(train_sampled_tbl$fail_in_30),labels=labs,col=c("gray", "navy"),main="Percent Fail in 30")
print(paste("After downsampling, there are" , as.character(nrow(train_sampled_tbl)), "rows, and", as.character(ncol(train_sampled_tbl)), "columns in",deparse(substitute(train_sampled_tbl)) ))
## [1] "After downsampling, there are 6190 rows, and 31 columns in train_sampled_tbl"
summary(train_sampled_tbl$fail_in_30)
## 0 1
## 3095 3095
# ?recipe
train_predictors <- train_sampled_tbl %>%
select(-unit_number,-count_me,-max_cycles, -time_cycles, -rul)
recipe_scaled_obj <- recipe(fail_in_30 ~ . , data=train_predictors) %>%
step_nzv(all_predictors()) %>%
step_center(all_predictors()) %>%
step_scale( all_predictors()) %>%
prep()
train_scaled_tbl <- bake(recipe_scaled_obj, new_data = train_predictors)
train_scaled_tbl %>%
select(setting_1, setting_2, sensor_2) %>%
plot_hist_facet(bins = 10, ncol = 3, fct_rev = F)
valid_sampled1_tbl <- valid_max_tbl %>%
select(-unit_number,-count_me,-max_cycles, -time_cycles, -rul)
valid_scaled_tbl <- bake(recipe_scaled_obj, new_data = valid_sampled1_tbl)
create_boxplots <- function(data, column_names, display_raw_data = FALSE) {
# Convert "fail_in_30" to a factor if it's not already
if (!is.factor(data$fail_in_30)) {
data$fail_in_30 <- factor(data$fail_in_30)
}
# Loop through each column and create box plots
for (column in column_names) {
# Create a title based on the column name
title <- paste("Box Plot -", column)
# Create the box plot
boxplot(data[[column]] ~ data$fail_in_30, main = title, xlab = "fail in 30")
# Display raw data as points if specified
if (display_raw_data) {
points(data$fail_in_30, data[[column]], col = "lightgray", pch = 16)
}
}
}
# par(mfrow = c(3, 2))
cols_for_chart <- c( "sensor_4", "sensor_7", "sensor_11","sensor_12", "sensor_15")
create_boxplots(data = train_scaled_tbl,
column_names = cols_for_chart,
display_raw_data = FALSE
)
cooksd <- cooks.distance(glm(fail_in_30 ~ .,
family = "binomial",
data = train_scaled_tbl ))
outliers <- rownames(train_scaled_tbl[cooksd > 4*mean(cooksd, na.rm=T), ])
outliers_title <- length(outliers)
plot(cooksd,
pch="*",
cex=1,
main= paste("Influential Points by Cooks distance (above red line),N=", outliers_title)
)
abline(h = 4*mean(cooksd, na.rm=T), col="red", lwd = 4)
# Correlation Funnel
train_scaled_tbl %>%
#binarize puts continuous and categorical values into bins
binarize() %>%
#correlate when Churn == Yes to all other variables
correlate(fail_in_30__1) %>%
plot_correlation_funnel(interactive = TRUE, alpha = 0.7)
# prepare data
correlation_tbl <- train_scaled_tbl %>%
rename_with(~ str_replace(., "^sensor_", "s"), starts_with("sensor_")) %>%
rename_with(~ str_replace(., "^setting_", "set"), starts_with("setting_")) %>%
mutate(fail_in_30_num = as.numeric(fail_in_30)) %>%
select(-fail_in_30)
corr_matrix <- cor(correlation_tbl)
# Create a correlation plot with color shading
corrplot(corr_matrix, method = "color", type = "lower",
tl.col = "black", tl.srt = 45, diag = FALSE)
threshold <- 0.75
cor_table <- correlation_tbl %>%
select_if(is.numeric) %>%
cor() %>%
as.data.frame() %>%
rownames_to_column(var = "variable1") %>%
gather(variable2, correlation, -variable1) %>%
mutate(correlation = round(correlation, 2))
# Filter correlations above the threshold
cor_table %>%
filter( (variable1 != variable2) & (variable1=="fail_in_30_num") & (abs(correlation) >= threshold) ) %>%
arrange(desc(abs(correlation))) %>%
kbl(caption = "Correlations >= 075") %>%
kable_classic(full_width = F, html_font = "Cambria")
| variable1 | variable2 | correlation |
|---|---|---|
| fail_in_30_num | s11 | 0.79 |
| fail_in_30_num | s4 | 0.78 |
| fail_in_30_num | s12 | -0.77 |
| fail_in_30_num | s7 | -0.75 |
| fail_in_30_num | s15 | 0.75 |
set.seed(42)
model_decision_tree <- decision_tree(mode = "classification",
cost_complexity = 0.0001,
tree_depth = 4,
min_n = 10) %>%
set_engine("rpart") %>%
fit(as.factor(fail_in_30) ~ ., data = train_scaled_tbl)
# summary(model_decision_tree)
model_decision_tree$fit %>%
rpart.plot(
roundint = FALSE,
type = 1,
extra = 101,
fallen.leaves = FALSE, #lines are angled
cex = 0.6, #font size
main = "Simple Decision Tree",
box.palette = "Blues"
)
Survival analysis (also called time-to-event analysis or duration analysis) is a branch of statistics aimed at analyzing the expected duration of time until one or more events happen, called survival times or duration times.
In survival analysis, we are interested in a certain event and want to analyze the time until the event happens.
survival_tbl <- train_sampled_tbl %>%
mutate(fail_in_30 = as.numeric(case_when(fail_in_30 == 1 ~ 1,
TRUE ~ 0
))) %>%
select(time_cycles, fail_in_30, sensor_11, sensor_4, sensor_7, sensor_12 )
surv_simple_km <- survfit(Surv(time_cycles, fail_in_30) ~ 1,data = survival_tbl)
tidy(surv_simple_km) %>%
head(20) %>%
kbl(caption = "Kaplan Meier Mortality Table, Top 20") %>%
kable_classic(full_width = F, html_font = "Cambria")
| time | n.risk | n.event | n.censor | estimate | std.error | conf.high | conf.low |
|---|---|---|---|---|---|---|---|
| 1 | 6190 | 0 | 20 | 1 | 0 | 1 | 1 |
| 2 | 6170 | 0 | 21 | 1 | 0 | 1 | 1 |
| 3 | 6149 | 0 | 16 | 1 | 0 | 1 | 1 |
| 4 | 6133 | 0 | 12 | 1 | 0 | 1 | 1 |
| 5 | 6121 | 0 | 17 | 1 | 0 | 1 | 1 |
| 6 | 6104 | 0 | 15 | 1 | 0 | 1 | 1 |
| 7 | 6089 | 0 | 12 | 1 | 0 | 1 | 1 |
| 8 | 6077 | 0 | 20 | 1 | 0 | 1 | 1 |
| 9 | 6057 | 0 | 24 | 1 | 0 | 1 | 1 |
| 10 | 6033 | 0 | 21 | 1 | 0 | 1 | 1 |
| 11 | 6012 | 0 | 15 | 1 | 0 | 1 | 1 |
| 12 | 5997 | 0 | 21 | 1 | 0 | 1 | 1 |
| 13 | 5976 | 0 | 19 | 1 | 0 | 1 | 1 |
| 14 | 5957 | 0 | 21 | 1 | 0 | 1 | 1 |
| 15 | 5936 | 0 | 20 | 1 | 0 | 1 | 1 |
| 16 | 5916 | 0 | 19 | 1 | 0 | 1 | 1 |
| 17 | 5897 | 0 | 21 | 1 | 0 | 1 | 1 |
| 18 | 5876 | 0 | 14 | 1 | 0 | 1 | 1 |
| 19 | 5862 | 0 | 14 | 1 | 0 | 1 | 1 |
| 20 | 5848 | 0 | 18 | 1 | 0 | 1 | 1 |
model_coxph <- coxph(Surv(time_cycles, fail_in_30) ~ . , data = survival_tbl)
# Regression Estimates
tidy(model_coxph)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 sensor_11 0.305 0.149 2.04 0.0413
## 2 sensor_4 -0.00522 0.00391 -1.34 0.182
## 3 sensor_7 -0.0142 0.0404 -0.351 0.725
## 4 sensor_12 -0.0482 0.0499 -0.967 0.333
# Mortality Table
model_coxph %>%
survfit() %>%
tidy() %>%
head(20) %>%
kbl(caption = "Cox PH Mortality Table, Top 20") %>%
kable_classic(full_width = F, html_font = "Cambria")
| time | n.risk | n.event | n.censor | estimate | std.error | conf.high | conf.low |
|---|---|---|---|---|---|---|---|
| 1 | 6190 | 0 | 20 | 1 | 0 | 1 | 1 |
| 2 | 6170 | 0 | 21 | 1 | 0 | 1 | 1 |
| 3 | 6149 | 0 | 16 | 1 | 0 | 1 | 1 |
| 4 | 6133 | 0 | 12 | 1 | 0 | 1 | 1 |
| 5 | 6121 | 0 | 17 | 1 | 0 | 1 | 1 |
| 6 | 6104 | 0 | 15 | 1 | 0 | 1 | 1 |
| 7 | 6089 | 0 | 12 | 1 | 0 | 1 | 1 |
| 8 | 6077 | 0 | 20 | 1 | 0 | 1 | 1 |
| 9 | 6057 | 0 | 24 | 1 | 0 | 1 | 1 |
| 10 | 6033 | 0 | 21 | 1 | 0 | 1 | 1 |
| 11 | 6012 | 0 | 15 | 1 | 0 | 1 | 1 |
| 12 | 5997 | 0 | 21 | 1 | 0 | 1 | 1 |
| 13 | 5976 | 0 | 19 | 1 | 0 | 1 | 1 |
| 14 | 5957 | 0 | 21 | 1 | 0 | 1 | 1 |
| 15 | 5936 | 0 | 20 | 1 | 0 | 1 | 1 |
| 16 | 5916 | 0 | 19 | 1 | 0 | 1 | 1 |
| 17 | 5897 | 0 | 21 | 1 | 0 | 1 | 1 |
| 18 | 5876 | 0 | 14 | 1 | 0 | 1 | 1 |
| 19 | 5862 | 0 | 14 | 1 | 0 | 1 | 1 |
| 20 | 5848 | 0 | 18 | 1 | 0 | 1 | 1 |
plot_customer_survival <- function(object_survfit,title) {
tidy(object_survfit) %>%
ggplot(aes(time, estimate)) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.5) +
geom_line(size = 1) +
theme_tq() +
scale_color_tq() +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = title,
x = "Cycles", y = "%Running")
}
survfit_coxph <- survfit(model_coxph)
par(mfrow = c(2, 1))
plot_customer_survival(surv_simple_km,"Survival Curve by Cycles: Kaplan Meier")
plot_customer_survival(survfit_coxph, "Survival Curve by Cycles: Cox Ph")
calculate_classification_metrics <- function(df, pred, actual, cutoff = 0) {
# cutoff = 0 will trigger a for-loop to evaluate many thresholds.
# cutoff > 0 means we want to evaluate a specific cutoff value
# cost for false positives and negatives
fn_cost <- 20
fp_cost <- 0.5
# Set the threshold values
if (cutoff > 0) {
thresh = cutoff
} else {
thresh <- seq(0.005, 0.5, 0.005)
}
# Create an empty dataframe to store the results
results_df <- data.frame(threshold = numeric(length(thresh)),
accuracy = numeric(length(thresh)),
error_rate = numeric(length(thresh)),
sensitivity = numeric(length(thresh)),
specificity = numeric(length(thresh)),
false_positive_rate = numeric(length(thresh)),
false_negative_rate = numeric(length(thresh)),
true_pos = numeric(length(thresh)),
true_neg = numeric(length(thresh)),
false_pos = numeric(length(thresh)),
false_neg = numeric(length(thresh)),
cost = numeric(length(thresh))
)
# Iterate over each threshold value
for (i in seq_along(thresh)) {
# Create a binary prediction based on the threshold
df$predicted <- ifelse(df[[pred]] >= thresh[i], 1, 0)
df[[actual]] <- as.factor(df[[actual]])
levels_actual <- levels(df[[actual]])
df$predicted <- as.factor(df$predicted)
levels_predicted <- levels(df$predicted)
l_act <- (length(levels_actual))
l_pred <-(length(levels_predicted))
if (length(levels_actual) != length(levels_predicted))
next
# Create the confusion matrix and calculate metrics
cm <- confusionMatrix(df[[actual]],df$predicted)
# Store the metrics in the results dataframe
results_df[i, "threshold"] <- thresh[i]
results_df[i, "accuracy"] <- cm$overall['Accuracy']
results_df[i, "error_rate"] <- 1 - cm$overall['Accuracy']
# Calculate true positives, true negatives, false positives, and false negatives
# the first number in brackets is the row number, the second is col number
results_df[i, "true_pos"] <- cm$table[2, 2]
results_df[i, "true_neg"] <- cm$table[1, 1]
results_df[i, "false_pos"] <- cm$table[1, 2]
results_df[i, "false_neg"] <- cm$table[2, 1]
# Calculate Sens TP / (TP + FN), Spec TN / (TN + FP)
results_df[i, "sensitivity"] <- results_df[i, "true_pos"] / (results_df[i, "true_pos"] + results_df[i, "false_neg"])
results_df[i, "specificity"] <- results_df[i, "true_neg"] / (results_df[i, "true_neg"] + results_df[i, "false_pos"])
# Calculate false positive rate and false negative rate
results_df[i, "false_positive_rate"] <- results_df[i, "false_pos"] / (results_df[i, "false_pos"] + results_df[i, "true_neg"])
results_df[i, "false_negative_rate"] <- results_df[i, "false_neg"] / (results_df[i, "false_neg"] + results_df[i, "true_pos"])
# Calculate cost
results_df[i, "cost"] <- results_df[i, "false_pos"]*fp_cost + results_df[i, "false_neg"]*fn_cost
}
# Return the results dataframe
return(results_df)
}
set.seed(42)
# model_log_train<-glm( as.factor(fail_in_30) ~ . , family = binomial, data = train_scaled_tbl )
# Remove time_cycles_norm -> insignificant
model_log_train<-glm( as.factor(fail_in_30) ~ . , family = binomial, data = train_scaled_tbl %>% select(-time_cycles_norm) )
summary(model_log_train)
##
## Call:
## glm(formula = as.factor(fail_in_30) ~ ., family = binomial, data = train_scaled_tbl %>%
## select(-time_cycles_norm))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.99226 -0.10307 0.00216 0.17016 2.82208
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.51340 0.06107 -8.407 < 2e-16 ***
## sensor_2 0.49252 0.10274 4.794 1.64e-06 ***
## sensor_3 0.41981 0.09382 4.474 7.66e-06 ***
## sensor_4 1.05486 0.14015 7.527 5.20e-14 ***
## sensor_7 -0.47980 0.13671 -3.510 0.000449 ***
## sensor_8 -0.36853 0.12702 -2.901 0.003715 **
## sensor_11 1.31905 0.16126 8.179 2.85e-16 ***
## sensor_12 -0.73500 0.15430 -4.763 1.90e-06 ***
## sensor_13 -0.37087 0.12491 -2.969 0.002988 **
## sensor_15 0.74934 0.11936 6.278 3.43e-10 ***
## sensor_17 0.61021 0.09689 6.298 3.01e-10 ***
## sensor_20 -0.45538 0.11254 -4.046 5.20e-05 ***
## sensor_21 -0.53256 0.11258 -4.731 2.24e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8581.2 on 6189 degrees of freedom
## Residual deviance: 2065.4 on 6177 degrees of freedom
## AIC: 2091.4
##
## Number of Fisher Scoring iterations: 7
rslt_log_train <- train_scaled_tbl %>% select(fail_in_30)
model_log_train_predicted <- predict( model_log_train,type ="response") # converts to prob
rslt_log_train_tbl <- data.frame(model_log_train_predicted,rslt_log_train )
colnames(rslt_log_train_tbl) <- c("probability", "actual")
rslt_log_train_metrics_tbl <- calculate_classification_metrics(df=rslt_log_train_tbl, pred="probability", actual="actual", cutoff = 0)
rslt_log_train_metrics_tbl <- rslt_log_train_metrics_tbl %>% arrange(cost)
cutoff <- rslt_log_train_metrics_tbl$threshold[1]
rslt_log_train_optimal_metrics_tbl <- head(rslt_log_train_metrics_tbl,1)
rslt_log_train_optimal_metrics_tbl <- pivot_longer(rslt_log_train_optimal_metrics_tbl, everything(), names_to = "Metric", values_to = "Value")
rslt_log_train_optimal_metrics_tbl <- rslt_log_train_optimal_metrics_tbl %>%
mutate(Value = round(Value,3))
rslt_log_train_optimal_metrics_tbl %>%
kbl(caption = "Logistic Regression: Training data results") %>%
kable_classic(full_width = F, html_font = "Cambria")
| Metric | Value |
|---|---|
| threshold | 0.045 |
| accuracy | 0.859 |
| error_rate | 0.141 |
| sensitivity | 0.999 |
| specificity | 0.719 |
| false_positive_rate | 0.281 |
| false_negative_rate | 0.001 |
| true_pos | 3091.000 |
| true_neg | 2226.000 |
| false_pos | 869.000 |
| false_neg | 4.000 |
| cost | 514.500 |
rslt_log_valid <- valid_scaled_tbl %>% select(fail_in_30)
model_log_valid_predicted <- predict(model_log_train, newdata = valid_scaled_tbl, type = "response" )
rslt_log_valid_tbl <- data.frame(model_log_valid_predicted, rslt_log_valid )
colnames(rslt_log_valid_tbl) <- c("probability", "actual")
rslt_log_valid_metrics_tbl <- calculate_classification_metrics(df=rslt_log_valid_tbl, pred="probability", actual="actual", cutoff = cutoff)
rslt_log_valid_metrics_tbl <- pivot_longer(rslt_log_valid_metrics_tbl, everything(), names_to = "Metric", values_to = "Value")
rslt_log_valid_metrics_tbl <- rslt_log_valid_metrics_tbl %>%
mutate(Value = round(Value,3))
rslt_log_valid_metrics_tbl %>%
kbl(caption = "Logistic Regression: Validation data results") %>%
kable_classic(full_width = F, html_font = "Cambria")
| Metric | Value |
|---|---|
| threshold | 0.045 |
| accuracy | 0.884 |
| error_rate | 0.116 |
| sensitivity | 0.988 |
| specificity | 0.881 |
| false_positive_rate | 0.119 |
| false_negative_rate | 0.012 |
| true_pos | 328.000 |
| true_neg | 11243.000 |
| false_pos | 1521.000 |
| false_neg | 4.000 |
| cost | 840.500 |
# ROC
valid_ROCprediction<-prediction(model_log_valid_predicted, rslt_log_valid_tbl$actual)
valid_ROCperformance<-performance(valid_ROCprediction, "tpr", "fpr")
plot(valid_ROCperformance, colorize=TRUE)
#AUC
valid_AUC <- unlist(slot(performance(valid_ROCprediction, "auc"), "y.values"))
print(paste("The AUC for validation is:", valid_AUC))
## [1] "The AUC for validation is: 0.985308359731625"
# par(mfrow =c(2,1))
train_scaled_fail_predict <- cbind(train_scaled_tbl,model_log_train_predicted) %>%
select(fail_in_30, model_log_train_predicted) %>%
filter(fail_in_30 ==1)%>%
arrange(model_log_train_predicted)
hist(train_scaled_fail_predict$model_log_train_predicted, main = "Train Predicted Prob for Actual Failures", xlab = "", breaks=30)
valid_scaled_fail_predict <- cbind(valid_scaled_tbl,model_log_valid_predicted) %>%
select(fail_in_30, model_log_valid_predicted) %>%
filter(fail_in_30 == 1) %>%
arrange(model_log_valid_predicted)
hist(valid_scaled_fail_predict$model_log_valid_predicted, main = "Validation Predicted Prob for Actual Failures", xlab = "", breaks=30)
boost_train_scaled_tbl <- train_scaled_tbl %>%
mutate(fail_in_30 = as.integer(case_when(fail_in_30 == 1 ~ 1,
TRUE ~ 0)))
boost_valid_scaled_tbl <- valid_scaled_tbl %>%
mutate(fail_in_30 = as.integer(case_when(fail_in_30 == 1 ~ 1,
TRUE ~ 0)))
y_train <- as.integer(boost_train_scaled_tbl$fail_in_30)
y_test <- as.integer(boost_valid_scaled_tbl$fail_in_30)
X_train <- boost_train_scaled_tbl %>% select(-fail_in_30)
X_test <- boost_valid_scaled_tbl %>% select(-fail_in_30)
xgb_train <- xgb.DMatrix(data = as.matrix(X_train), label = y_train)
xgb_test <- xgb.DMatrix(data = as.matrix(X_test), label = y_test)
params <- list(
objective = "binary:logistic",
eval_metric = "error",
max.depth = 3, # the maximum depth of each decision tree default 6
early_stopping_rounds = 3 # if we dont see an improvement in this many rounds, stop
)
set.seed(42)
nrounds <- 10 # Set the number of boosting rounds
xgb_model <- xgboost(data = xgb_train, params = params, nrounds = nrounds)
## [20:08:29] WARNING: amalgamation/../src/learner.cc:627:
## Parameters: { "early_stopping_rounds" } might not be used.
##
## This could be a false alarm, with some parameters getting used by language bindings but
## then being mistakenly passed down to XGBoost core, or some parameter actually being used
## but getting flagged wrongly here. Please open an issue if you find any such cases.
##
##
## [1] train-error:0.082714
## [2] train-error:0.079321
## [3] train-error:0.068982
## [4] train-error:0.067690
## [5] train-error:0.065913
## [6] train-error:0.064943
## [7] train-error:0.064459
## [8] train-error:0.065267
## [9] train-error:0.063005
## [10] train-error:0.063005
# get information on how important each feature is
importance_matrix <- xgb.importance(names(X_train), model = xgb_model)
xgb.plot.importance(importance_matrix)
# generate predictions for training
xgb_pred_train <- predict(xgb_model, xgb_train)
# create df with predictions and the actual values
xgb_train_tbl <- data.frame(xgb_pred_train,rslt_log_train)
colnames(xgb_train_tbl) <- c("probability", "actual")
# get cost based on different thresholds
rslt_xgb_train_metrics_tbl <- calculate_classification_metrics(df=xgb_train_tbl, pred="probability", actual="actual", cutoff = 0)
rslt_xgb_train_metrics_tbl <- rslt_xgb_train_metrics_tbl %>%
# function can skip thresh if levels!=2, default thresh 0
filter(threshold > 0) %>%
arrange(cost)
xgb_cutoff <- rslt_xgb_train_metrics_tbl$threshold[1]
rslt_xgb_train_optimal_metrics_tbl <- head(rslt_xgb_train_metrics_tbl,1)
rslt_xgb_train_optimal_metrics_tbl <- pivot_longer(rslt_xgb_train_optimal_metrics_tbl, everything(), names_to = "Metric", values_to = "Value")
rslt_xgb_train_optimal_metrics_tbl <- rslt_xgb_train_optimal_metrics_tbl %>%
mutate(Value = round(Value,3))
rslt_xgb_train_optimal_metrics_tbl %>%
kbl(caption = "xgboost: Training data results") %>%
kable_classic(full_width = F, html_font = "Cambria")
| Metric | Value |
|---|---|
| threshold | 0.110 |
| accuracy | 0.892 |
| error_rate | 0.108 |
| sensitivity | 0.999 |
| specificity | 0.784 |
| false_positive_rate | 0.216 |
| false_negative_rate | 0.001 |
| true_pos | 3093.000 |
| true_neg | 2428.000 |
| false_pos | 667.000 |
| false_neg | 2.000 |
| cost | 373.500 |
# generate predictions
xgb_pred_test <- predict(xgb_model, xgb_test)
# create df with predictions and the actual values
xgb_test_tbl <- data.frame(xgb_pred_test,rslt_log_valid)
colnames(xgb_test_tbl) <- c("probability", "actual")
# get cost based on different thresholds
rslt_xgb_test_metrics_tbl <- calculate_classification_metrics(df=xgb_test_tbl, pred="probability", actual="actual", cutoff = xgb_cutoff)
rslt_xgb_test_metrics_tbl <- pivot_longer(rslt_xgb_test_metrics_tbl, everything(), names_to = "Metric", values_to = "Value")
rslt_xgb_test_metrics_tbl <- rslt_xgb_test_metrics_tbl %>%
mutate(Value = round(Value,3))
rslt_xgb_test_metrics_tbl %>%
kbl(caption = "xgboost: Validation data results") %>%
kable_classic(full_width = F, html_font = "Cambria")
| Metric | Value |
|---|---|
| threshold | 0.110 |
| accuracy | 0.917 |
| error_rate | 0.083 |
| sensitivity | 0.970 |
| specificity | 0.916 |
| false_positive_rate | 0.084 |
| false_negative_rate | 0.030 |
| true_pos | 322.000 |
| true_neg | 11692.000 |
| false_pos | 1072.000 |
| false_neg | 10.000 |
| cost | 736.000 |
Both machine learning models provide significant value.
fn_cost = 20
valid_fail <- valid_scaled_tbl %>%
select(fail_in_30) %>%
mutate( fail_in_30_num = as.numeric(case_when (fail_in_30 == 1 ~ 1,
TRUE ~0))
)
valid_naiive_cost <- sum(valid_fail$fail_in_30_num) * fn_cost
valid_logistic_cost <- rslt_log_valid_metrics_tbl %>%
filter(Metric == 'cost') %>%
select(Value)
valid_logistic_cost <- valid_logistic_cost$Value[1]
valid_xgb_cost <- rslt_xgb_test_metrics_tbl %>%
filter(Metric == 'cost') %>%
select(Value)
valid_xgb_cost <- valid_xgb_cost$Value[1]
paste("The naiive approach cost is:", valid_naiive_cost, "the cost of the logistic regression was",valid_logistic_cost, "the xgboost cost was",valid_xgb_cost )
## [1] "The naiive approach cost is: 6640 the cost of the logistic regression was 840.5 the xgboost cost was 736"